Manipulate Dataset
In order to perform explanatory data analysis, we will preprocess the data into 2 different formats.
Average of all scenarios : In order to plot a temperature average of locations onto a map, scenarios must be merged into one data point.
Leave all scenarios : for a more detailed analysis, all scenarios will be kept separately
Import module / Set options and theme
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import xml.etree.ElementTree as ET
import plotly.express as px
from scipy.stats import ttest_rel
from statsmodels.stats.weightstats import ttest_ind
import pingouin as pg
from scipy.stats import zscore
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings("ignore" )
pd.set_option('display.max_columns' , None )
pd.set_option('display.precision' , 10 )
Import cleaned data
df = pd.read_csv('../data/cleaned_df.csv' )
df['Location_ID' ] = df.groupby(['long' , 'lat' ]).ngroup() + 1
Average Scenarios
The Average Scenarios dataset averages all the numerical columns of the scenarios into one row, outputing one row for each Location, Year, and RCP. This dataset is used when conducting EDA and visualizing overtrend
Clean data (Average Scenarios)
group_list = ['Park' , 'long' , 'lat' , 'veg' , 'year' , 'TimePeriod' , 'RCP' ,'treecanopy' , 'Ann_Herb' , 'Bare' , 'Herb' , 'Litter' , 'Shrub' , 'El' , 'Sa' ,'Cl' , 'RF' , 'Slope' , 'E' , 'S' ]
veg_location = df.drop(labels= 'scenario' ,axis= 1 ).groupby(group_list).mean().reset_index()
numeric_series = pd.to_numeric(veg_location['RCP' ], errors= 'coerce' )
numeric_series
veg_location['RCP' ] = numeric_series.fillna(veg_location['RCP' ])
four = veg_location[veg_location['RCP' ].isin([4.5 ])]
eight = veg_location[veg_location['RCP' ].isin([8.5 ])]
four_h = veg_location[veg_location['RCP' ].isin(['historical' ])]
four_h['RCP' ] = 4.5
eight_h = veg_location[veg_location['RCP' ].isin(['historical' ])]
eight_h['RCP' ] = 8.5
df_con = pd.concat([four_h, four, eight_h, eight], ignore_index= True )
df_con['Location_ID' ] = df_con.groupby(['long' , 'lat' ]).ngroup() + 1
df_con.head(5 )
0
NABR
-110.0472
37.60413
Shrubland
1980
Hist
4.5
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
-0.6636760860
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.7140658366
6.3995308949
1.5598074021
3.3632779979
NaN
24.34
36.16
29.52
NaN
24.34
36.16
29.52
75.0
34.0
0.0
26.0
3.4668806371
2.6546632530
0.0321140671
0.4880867481
3.4668806371
2.6546632530
0.0321140671
0.4880867481
7.7811633032
31.1394527955
48.0177480655
21.9156825756
13.79
8.71
2.69
6.37
36.5000000000
36.5000000000
3.4668806371
2.6546632530
0.0321140671
0.4880867481
3.4668806371
2.6546632530
0.0321140671
0.4880867481
0.96483520
8.767935
23.15924
11.962090
14.15
28.75
37.05
31.15
-12.45
-7.35
5.55
-10.25
0.2370806
5.296833
1.067496
1.9667860
0.1134468701
0.0968307001
0.0418759016
0.0522975530
0.1134468701
0.0968307001
0.0418759016
0.0522975530
91.0
77.0
5.0
47.0
91.0
77.0
5.0
47.0
31.56
11.21352505
54.57202074
1
1
NABR
-110.0472
37.60413
Shrubland
1981
Hist
4.5
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
0.3478010620
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2.1815202084
5.9723378265
5.0428776741
4.6374034668
13.92
26.53
36.08
NaN
13.92
26.53
36.08
NaN
79.0
26.0
0.0
13.0
0.3461917264
0.8982752558
0.0336629893
2.5013360811
0.3461917264
0.8982752558
0.0336629893
2.5013360811
8.1229049607
32.3882557036
48.1772426406
21.7575735702
2.25
9.81
9.39
11.75
13.2500000000
13.2500000000
0.3461917264
0.8982752558
0.0336629893
2.5013360811
0.3461917264
0.8982752558
0.0336629893
2.5013360811
3.33444400
10.548370
23.27065
11.581320
17.05
28.15
37.55
29.75
-9.35
-5.55
1.25
-7.25
0.2930753
3.506108
3.916328
2.7875470
0.0493818430
0.0607271763
0.0426386771
0.0936706801
0.0493818430
0.0607271763
0.0426386771
0.0936706801
48.0
60.0
13.0
85.0
48.0
60.0
13.0
85.0
33.20
12.18369600
54.57202074
1
2
NABR
-110.0472
37.60413
Shrubland
1982
Hist
4.5
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
0.3260300992
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.2589947135
4.7173273934
4.5276363327
4.2477717540
NaN
26.19
34.99
22.06
NaN
26.19
34.99
22.06
83.0
21.0
0.0
30.0
3.2599844936
1.5994982052
0.1993822366
1.2432253150
3.2599844936
1.5994982052
0.1993822366
1.2432253150
7.3379526955
31.4894498184
47.1800768757
21.0684231651
4.12
5.10
9.50
9.83
17.2857142857
17.2857142857
3.2599844936
1.5994982052
0.1993822366
1.2432253150
3.2599844936
1.5994982052
0.1993822366
1.2432253150
-0.01555556
9.472283
22.05707
9.869231
14.35
28.45
36.65
31.85
-16.55
-7.25
5.65
-6.25
0.2453347
3.105047
3.523923
2.8900990
0.1092341982
0.0748166564
0.0456102615
0.0677891794
0.1092341982
0.0748166564
0.0456102615
0.0677891794
90.0
62.0
19.0
73.0
90.0
62.0
19.0
73.0
28.55
10.34575711
54.57202074
1
3
NABR
-110.0472
37.60413
Shrubland
1983
Hist
4.5
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
0.0388273872
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.7419915365
6.2671578978
5.1695757094
3.7751048188
NaN
28.56
33.69
31.02
NaN
28.56
33.69
31.02
85.0
32.0
0.0
19.0
3.8064480379
2.9456592119
0.0960442305
1.5835966242
3.8064480379
2.9456592119
0.0960442305
1.5835966242
7.4798456947
30.3128312703
46.5762368398
21.8471460016
7.09
10.80
10.22
10.40
16.7142857143
16.7142857143
3.8064480379
2.9456592119
0.0960442305
1.5835966242
3.8064480379
2.9456592119
0.0960442305
1.5835966242
0.40944440
8.020652
21.32826
11.325820
13.35
30.65
34.55
33.15
-15.05
-7.25
3.85
-8.95
0.2252735
4.962824
5.006576
1.1952350
0.1204177901
0.1025422325
0.0441405046
0.0748017843
0.1204177901
0.1025422325
0.0441405046
0.0748017843
90.0
74.0
15.0
69.0
90.0
74.0
15.0
69.0
38.51
10.27104410
54.57202074
1
4
NABR
-110.0472
37.60413
Shrubland
1984
Hist
4.5
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
0.2166602692
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.6272686835
5.0078604793
5.2303324404
4.0803373430
NaN
30.95
34.01
29.52
NaN
30.95
34.01
29.52
91.0
35.0
0.0
30.0
3.7945975224
1.7555326024
0.0452782946
1.3792575946
3.7945975224
1.7555326024
0.0452782946
1.3792575946
7.1730101555
31.9972417196
47.0386757592
21.0183982059
4.77
4.32
9.49
8.17
16.5000000000
16.5000000000
3.7945975224
1.7555326024
0.0452782946
1.3792575946
3.7945975224
1.7555326024
0.0452782946
1.3792575946
-1.04725300
9.853804
21.95978
10.034070
10.25
32.75
35.35
31.35
-18.45
-8.45
2.95
-12.45
0.1226868
3.120243
4.269040
0.9273169
0.1202091711
0.0778415354
0.0431793330
0.0703661709
0.1202091711
0.0778415354
0.0431793330
0.0703661709
91.0
65.0
16.0
62.0
91.0
65.0
16.0
62.0
26.75
10.20010025
54.57202074
1
All Scenarios
Clean Data(All Scenarios)
numeric_series = pd.to_numeric(df['RCP' ], errors= 'coerce' )
numeric_series
df['RCP' ] = numeric_series.fillna(df['RCP' ])
four = df[df['RCP' ].isin([4.5 ])]
eight = df[df['RCP' ].isin([8.5 ])]
four_h = df[df['RCP' ].isin(['historical' ])]
four_h['RCP' ] = 4.5
eight_h = df[df['RCP' ].isin(['historical' ])]
eight_h['RCP' ] = 8.5
df_orig = pd.concat([four_h, four, eight_h, eight], ignore_index= True )
df_orig['Location_ID' ] = df_orig.groupby(['long' , 'lat' ]).ngroup() + 1
df_orig.head(5 )
0
NABR
-110.0472
37.60413
Shrubland
1980
Hist
4.5
sc1
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
-0.6636760860
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.7140658366
6.3995308949
1.5598074021
3.3632779979
NaN
24.34
36.16
29.52
NaN
24.34
36.16
29.52
75.0
34.0
0.0
26.0
3.4668806371
2.6546632530
0.0321140671
0.4880867481
3.4668806371
2.6546632530
0.0321140671
0.4880867481
7.7811633032
31.1394527955
48.0177480655
21.9156825756
13.79
8.71
2.69
6.37
36.5000000000
36.5000000000
3.4668806371
2.6546632530
0.0321140671
0.4880867481
3.4668806371
2.6546632530
0.0321140671
0.4880867481
0.96483520
8.767935
23.15924
11.962090
14.15
28.75
37.05
31.15
-12.45
-7.35
5.55
-10.25
0.2370806
5.296833
1.067496
1.9667860
0.1134468701
0.0968307001
0.0418759016
0.0522975530
0.1134468701
0.0968307001
0.0418759016
0.0522975530
91.0
77.0
5.0
47.0
91.0
77.0
5.0
47.0
31.56
11.21352505
54.57202074
1
1
NABR
-110.0472
37.60413
Shrubland
1981
Hist
4.5
sc1
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
0.3478010620
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2.1815202084
5.9723378265
5.0428776741
4.6374034668
13.92
26.53
36.08
NaN
13.92
26.53
36.08
NaN
79.0
26.0
0.0
13.0
0.3461917264
0.8982752558
0.0336629893
2.5013360811
0.3461917264
0.8982752558
0.0336629893
2.5013360811
8.1229049607
32.3882557036
48.1772426406
21.7575735702
2.25
9.81
9.39
11.75
13.2500000000
13.2500000000
0.3461917264
0.8982752558
0.0336629893
2.5013360811
0.3461917264
0.8982752558
0.0336629893
2.5013360811
3.33444400
10.548370
23.27065
11.581320
17.05
28.15
37.55
29.75
-9.35
-5.55
1.25
-7.25
0.2930753
3.506108
3.916328
2.7875470
0.0493818430
0.0607271763
0.0426386771
0.0936706801
0.0493818430
0.0607271763
0.0426386771
0.0936706801
48.0
60.0
13.0
85.0
48.0
60.0
13.0
85.0
33.20
12.18369600
54.57202074
1
2
NABR
-110.0472
37.60413
Shrubland
1982
Hist
4.5
sc1
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
0.3260300992
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.2589947135
4.7173273934
4.5276363327
4.2477717540
NaN
26.19
34.99
22.06
NaN
26.19
34.99
22.06
83.0
21.0
0.0
30.0
3.2599844936
1.5994982052
0.1993822366
1.2432253150
3.2599844936
1.5994982052
0.1993822366
1.2432253150
7.3379526955
31.4894498184
47.1800768757
21.0684231651
4.12
5.10
9.50
9.83
17.2857142857
17.2857142857
3.2599844936
1.5994982052
0.1993822366
1.2432253150
3.2599844936
1.5994982052
0.1993822366
1.2432253150
-0.01555556
9.472283
22.05707
9.869231
14.35
28.45
36.65
31.85
-16.55
-7.25
5.65
-6.25
0.2453347
3.105047
3.523923
2.8900990
0.1092341982
0.0748166564
0.0456102615
0.0677891794
0.1092341982
0.0748166564
0.0456102615
0.0677891794
90.0
62.0
19.0
73.0
90.0
62.0
19.0
73.0
28.55
10.34575711
54.57202074
1
3
NABR
-110.0472
37.60413
Shrubland
1983
Hist
4.5
sc1
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
0.0388273872
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.7419915365
6.2671578978
5.1695757094
3.7751048188
NaN
28.56
33.69
31.02
NaN
28.56
33.69
31.02
85.0
32.0
0.0
19.0
3.8064480379
2.9456592119
0.0960442305
1.5835966242
3.8064480379
2.9456592119
0.0960442305
1.5835966242
7.4798456947
30.3128312703
46.5762368398
21.8471460016
7.09
10.80
10.22
10.40
16.7142857143
16.7142857143
3.8064480379
2.9456592119
0.0960442305
1.5835966242
3.8064480379
2.9456592119
0.0960442305
1.5835966242
0.40944440
8.020652
21.32826
11.325820
13.35
30.65
34.55
33.15
-15.05
-7.25
3.85
-8.95
0.2252735
4.962824
5.006576
1.1952350
0.1204177901
0.1025422325
0.0441405046
0.0748017843
0.1204177901
0.1025422325
0.0441405046
0.0748017843
90.0
74.0
15.0
69.0
90.0
74.0
15.0
69.0
38.51
10.27104410
54.57202074
1
4
NABR
-110.0472
37.60413
Shrubland
1984
Hist
4.5
sc1
0
0
84
5
11
7
1764.955
77.03307
6.082058
2.285707
1949.283
-8753.784
4834.13
0.2166602692
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3.6272686835
5.0078604793
5.2303324404
4.0803373430
NaN
30.95
34.01
29.52
NaN
30.95
34.01
29.52
91.0
35.0
0.0
30.0
3.7945975224
1.7555326024
0.0452782946
1.3792575946
3.7945975224
1.7555326024
0.0452782946
1.3792575946
7.1730101555
31.9972417196
47.0386757592
21.0183982059
4.77
4.32
9.49
8.17
16.5000000000
16.5000000000
3.7945975224
1.7555326024
0.0452782946
1.3792575946
3.7945975224
1.7555326024
0.0452782946
1.3792575946
-1.04725300
9.853804
21.95978
10.034070
10.25
32.75
35.35
31.35
-18.45
-8.45
2.95
-12.45
0.1226868
3.120243
4.269040
0.9273169
0.1202091711
0.0778415354
0.0431793330
0.0703661709
0.1202091711
0.0778415354
0.0431793330
0.0703661709
91.0
65.0
16.0
62.0
91.0
65.0
16.0
62.0
26.75
10.20010025
54.57202074
1
Data Exploration
Basic Statistics
Basic Statistics
79 years of prediction (2021~2099)
40 scenarios (sc22~sc61)
2 RCP scenarios(4.5, 8.5)
113 locations
Explanation
The data is collected over 113 locations within the Natural Bridge National Monument. (Number of Unique latitude, longitude combinations)
This dataset is composed of 41 years of historical data and 79 years worth of predictions. Since there can be only one scenario for past data, all historical data is labeled as ‘sc1’ or scenario one
From the predicted years (2021 to 2099), There are two RCP scenarios which changes only the RCP variable and 40 scenarios which simulate 86 other variables.
Based on each combination of scenarios, a prediction is made for each location point regarding various columns of the dataset including annual and seasonal percipitation, seasonal dry soil days, seasonal evaporation, seasonal extreme short term dry stress, soil water availability to output a final prediction for Annual and seasonal temperatures.
What is RCP?
Representative Concentration Pathways : A group of scenarios where CO2 emmission is predicted like the image below
The dataset consists of two RCP scenarios 4.5 and 8.5
source : Representative Concentration Pathway. (2024, May 2). In Wikipedia. https://en.wikipedia.org/wiki/Representative_Concentration_Pathway
Location
Where is this data located and how does it look like?
The data points were sampled at the Natural Bridge National Monument in Utah. And for a better idea of how temperatures and vegetations are distributed, the plots below show two different location aspects of the dataset. The first plot is the average annual temperature for each location point in the year 2099. Since the temperature for predictions increase over time, the last year for the dataset was chosen for a more dramatic comparison
The second plot is a scatter plot of the locations of vegetations. By comparing the two graphs, we can tell that there isn’t much correlation with vegetation and annual temperature but rather a correlation with the location(latitude, longitude) and temperature. We will get to this in the following visualizations.
Map Visualizations
map = df_con[df_con['year' ]== 2099 ].groupby(['long' ,'lat' ])['T_Annual' ].mean().reset_index()
filtered_df = map
fig = px.scatter_mapbox(filtered_df, lat= "lat" , lon= "long" , color= "T_Annual" , size= "T_Annual" ,
color_continuous_scale= px.colors.cyclical.IceFire, size_max= 8 , zoom= 11 ,
mapbox_style= "open-street-map" )
fig.update_layout(
title= {
'text' : "<b>Average Temperature (2099) </b>" ,
'y' : 0.97 ,
'x' : 0.5 ,
'xanchor' : 'center' ,
'yanchor' : 'top'
},
margin= {"r" : 0 , "t" : 40 , "l" : 0 , "b" : 0 }
)
fig.show()
map = df_con[df_con['year' ]== 2099 ].groupby(['long' ,'lat' ,'veg' ]).size().reset_index()
filtered_df = map
fig = px.scatter_mapbox(map , lat= "lat" , lon= "long" , color= "veg" ,
color_continuous_scale= px.colors.cyclical.IceFire, size_max= 8 , zoom= 11 ,
mapbox_style= "open-street-map" )
fig.update_layout(
title= {
'text' : "<b>Vegetation Location</b>" ,
'y' : 0.97 ,
'x' : 0.5 ,
'xanchor' : 'center' ,
'yanchor' : 'top'
},
coloraxis_colorbar= {
'title' : 'Vegetation Level'
},
legend= {
'x' : 1 ,
'y' : 0.8 ,
'xanchor' : 'left' ,
'yanchor' : 'middle'
},
margin= {"r" : 0 , "t" : 40 , "l" : 0 , "b" : 0 }
)
fig.update_traces(marker= dict (size= 10 ))
fig.show()
Temperature/Percipitation Trends
The following plots were drawn by averaging all scenarios, locations, and RCPs for a given year for annual temperature and annual percipitation to see the overall trend of the predictions of the dataset. Predictions were made from the year 2021 which is
We can see that the predictions portray an increase in temperature but a fluctuation with percipitation allowing us to make an educated guess that temperature is the more important variable for RCP scenarios which deal with CO2 emission.
Temperature / Percipitation Predictions Overview
filtered_data = df_con.groupby(['year' ])['T_Annual' ].mean().reset_index()
fig = px.line(
data_frame= filtered_data,
x= 'year' ,
y= 'T_Annual' ,
title= '<b>Annual Temperature</b>' ,
labels= {'T_Annual' : 'Annual Temperature' },
line_shape= 'spline'
)
fig.add_shape(
dict (
type = 'line' ,
x0= 2021 ,
y0= filtered_data['T_Annual' ].min ()/ 1.1 ,
x1= 2021 ,
y1= filtered_data['T_Annual' ].max ()* 1.1 ,
line= dict (
color= "Red" ,
width= 2 ,
dash= "dash" ,
),
)
)
fig.add_annotation(
dict (
x= 2021 ,
y= filtered_data['T_Annual' ].max (),
xref= "x" ,
yref= "y" ,
text= "Prediction" ,
showarrow= False ,
font= dict (
size= 12 ,
color= "Red"
),
align= "center" ,
xanchor= "left"
)
)
fig.update_layout(title= {'x' :0.5 })
fig.show()
filtered_data = df_con.groupby(['year' ])['PPT_Annual' ].mean().reset_index()
fig = px.line(
data_frame= filtered_data,
x= 'year' ,
y= 'PPT_Annual' ,
title= '<b>Annual Precipitation</b>' ,
labels= {'T_Annual' : 'Annual Temperature' },
line_shape= 'spline'
)
fig.add_shape(
dict (
type = 'line' ,
x0= 2021 ,
y0= filtered_data['PPT_Annual' ].min ()/ 1.1 ,
x1= 2021 ,
y1= filtered_data['PPT_Annual' ].max ()* 1.1 ,
line= dict (
color= "Red" ,
width= 2 ,
dash= "dash" ,
),
)
)
fig.add_annotation(
dict (
x= 2021 ,
y= filtered_data['PPT_Annual' ].max (),
xref= "x" ,
yref= "y" ,
text= "Prediction" ,
showarrow= False ,
font= dict (
size= 12 ,
color= "Red"
),
align= "center" ,
xanchor= "left"
)
)
fig.update_layout(title= {'x' :0.5 })
fig.show()
Perspectives to Consider
What are some aspects of the datasets we can slice and dice or drill down to compare and retrieve meaningful insights?
Most numerical features are generated based on the scenario of the model that generated future data, and some numerical features such ase S,E,Slope, RF, Cl, Sa, El, treecanopy etc. are features that are fixed according to a unique location. Therefore categorical variables are the aspects of the datasets we can compare to create new insights
Categorical Variables :
The following plots compare the predicted annual temperature for each category for the three categorical variables
Temperature RCP comparison
filtered_data = df_con.groupby(['year' ,'RCP' ])['T_Annual' ].mean().reset_index()
fig = px.line(
data_frame= filtered_data,
x= 'year' ,
y= 'T_Annual' ,
color= 'RCP' ,
title= '<b>Annual Temperature by RCP Type</b>' ,
labels= {'T_Annual' : 'Annual Temperature' },
line_shape= 'spline'
)
fig.update_layout(title= {'x' :0.5 })
fig.add_shape(
dict (
type = 'line' ,
x0= 2021 ,
y0= filtered_data['T_Annual' ].min ()/ 1.1 ,
x1= 2021 ,
y1= filtered_data['T_Annual' ].max ()* 1.1 ,
line= dict (
color= "Red" ,
width= 2 ,
dash= "dash" ,
),
)
)
fig.add_annotation(
dict (
x= 2021 ,
y= filtered_data['T_Annual' ].max (),
xref= "x" ,
yref= "y" ,
text= "Prediction" ,
showarrow= False ,
font= dict (
size= 12 ,
color= "Red"
),
align= "center" ,
xanchor= "left"
)
)
fig.show()
Since RCP deals with CO2 emission and the 8.5 scenario has a higher emission prediction than the 4.5 scenario, the annual temperature increase of RCP 8.5 is more rapid than rcp4.5 although both are increasing.
Temperature comparison (Vegetation)
filtered_data = df_con[df_con['RCP' ].isin(['historical' , 4.5 ])].groupby(['year' ,'veg' ])['T_Annual' ].mean().reset_index()
fig = px.line(
data_frame= filtered_data,
x= 'year' ,
y= 'T_Annual' ,
color= 'veg' ,
title= '<b>Annual Temperature by Vegetation Type</b>' ,
labels= {'T_Annual' : 'Annual Temperature' }
)
fig.update_layout(title= {'x' :0.5 })
fig.add_shape(
dict (
type = 'line' ,
x0= 2021 ,
y0= filtered_data['T_Annual' ].min ()/ 1.1 ,
x1= 2021 ,
y1= filtered_data['T_Annual' ].max ()* 1.1 ,
line= dict (
color= "Red" ,
width= 2 ,
dash= "dash" ,
),
)
)
fig.add_annotation(
dict (
x= 2021 ,
y= filtered_data['T_Annual' ].max (),
xref= "x" ,
yref= "y" ,
text= "Prediction" ,
showarrow= False ,
font= dict (
size= 12 ,
color= "Red"
),
align= "center" ,
xanchor= "left"
)
)
fig.show()
The vegetations seem to follow exactly the same pattern of prediciton with a fixed difference between each other. This may mean that when calculating predictions based on scenarios, the algorithm was modeled so that the mean of the vegetations were always a given distance apart from each other. Because of this limitation of the algorithm, it is unncessary to compare vegetations from each other. We will always get the same difference.
Temperature comparison (scenario)
df_filtered = df_orig[df_orig['TimePeriod' ] != 'Hist' ]
medians = df_filtered.groupby('scenario' )['T_Annual' ].median().reset_index()
medians = medians.sort_values('T_Annual' )
df_sorted = pd.merge(medians['scenario' ], df_filtered, on= 'scenario' , how= 'left' )
fig = px.box(df_sorted, x= 'scenario' , y= 'T_Annual' , color= 'RCP' )
fig.update_layout(
xaxis_tickangle=- 90 ,
title= {
'text' : "<b>Annual Temperature by Scenario</b>" ,
'x' :0.5 ,
'xanchor' : 'center'
}
)
fig.show()
Since we already know that RCP plays a big role in how the algorithm predicts the temperature, We will group the dataset into RCP4.5 scenarios and RCP8.5 scenarios to see if there is a significant difference. By examining the plot we now know that RCP 4.5 correspons to scenario 22~41 and RCP 8.5 correspons to scenario 42~61. There are cases where 4.5 scenarios had higher temperatures than 8.5 scenarios, but since RCP acts as the first drill down layer of the dataset, we can use the scenario column as the second drilldown of the dataset.
Statistical Significance
Is there a significant difference between different scenarios?
Before we start analyzing our dataset, one final step we want to take is proving the statistical significance in the different scenarios we plan on comparing.
The three comparisons we plan on making are as follows:
RCP 8.5(High) vs RCP 4.5(Low)
RCP 4.5 : Scenario 37(High) vs Scenario 40(Low)
RCP 8.5 : Scenario 60(High) vs Scenario 58(Low)
T-test for RCP 4.5 and 8.8
data_before = df_orig[(df_orig['RCP' ] == 8.5 ) & (df_orig['TimePeriod' ] != 'Hist' )]['T_Annual' ]
data_after = df_orig[(df_orig['RCP' ] == 4.5 ) & (df_orig['TimePeriod' ] != 'Hist' )]['T_Annual' ]
result = pg.ttest(data_before,
data_after,
correction= True )
T-test for RCP 4.5 and 8.8
t-value
232.998
95% Confidence Interval
[1.25 1.27]
p-Value
0.000
T-test for Scenario 40 vs 37
data_before = df_orig[df_orig['scenario' ] == 'sc40' ]['T_Annual' ]
data_after = df_orig[df_orig['scenario' ] == 'sc37' ]['T_Annual' ]
result = pg.ttest(data_before,
data_after,
correction= True )
T-test for Scenario 40 vs 37
t-value
-157.977
95% Confidence Interval
[-2.51 -2.45]
p-Value
0.000
T-test for Scenario 60 vs 58
data_before = df_orig[df_orig['scenario' ] == 'sc60' ]['T_Annual' ]
data_after = df_orig[df_orig['scenario' ] == 'sc58' ]['T_Annual' ]
result = pg.ttest(data_before,
data_after,
correction= True )
T-test for Scenario 60 vs 58
t-value
-125.742
95% Confidence Interval
[-3.61 -3.5 ]
p-Value
0.000
Conclusion
For our dataset analysis, we will be comparing the maximum and minimum scenario for each RCP group to analyze what features affect temperature the most. That is comparing scenario 37 to scenario 40 for RCP 4.5 scenarios, and comparing scenario 58 to scenario 60 to do the same for RCP 8.5.
Next Steps!
Now that we’ve proved that the difference between RCP scenarios, and the highest and lowest scenario for each RCP group are all statistically significant, lets dive deeper into the dataset to construct visualizations to hypothesize features that have correlations to the predicted temperature!